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1. Introduction 

In the context of software, what is traditionally 
called "integration" is the engineering process that cre- 
ates or improves information flows between informa- 
tion systems designed for different purposes. What 
actually flows between the systems is data, but what is 
critical to the business process is that all of the right 
data flows in the right form for the receiving system, 
and that the receiving system and the people who use it 
interpret the data correctly 

The term "conceptual integrity" was popularized in 
Ref. [1] to refer to a kind of consistency in system 
architecture that allows the system to become a cohe- 
sive, sensible whole. A similar kind of conceptual 
integrity is required for the result of data integration to 
be cohesive and sensible. Compromised conceptual 
integrity results in "semantic faults," which are com- 
monly blamed for latent integration bugs. 

Most technical approaches to data integration fall 
squarely into one of two categories. There is the "glob- 
al schema" category, where every schema is mapped 



into a common reference schema, and there is the direct 
translation category, where schemata are mapped 
directly to one another in a point-to-point fashion. Each 
category has widely recognized advantages and disad- 
vantages. Among these is the efficiency argument in 
favor of standardization (i.e., having a standard global 
schema): to link n different systems directly requires 
n 2 - n one-directional mappings, but to link them via a 
global schema requires only In. 

It is sometimes claimed that direct translation allows 
for better conceptual integrity on a technical level 
(ignoring the human factors of dealing with n 2 - n dif- 
ferent translations) because one can translate only what 
is necessary for communication and ignore anything 
that is conflicting but irrelevant. However, after discus- 
sion of this point in the Automated Methods for 
Integrating Systems [2] project, it was realized that 
such a translation implies a certain "integration 
schema" which, regardless of whether it is written 
down or only in the mind of the integrator, is neverthe- 
less equivalent in its impact on conceptual integrity to 
having used a global schema. 
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With that perspective, it should be possible to create 
an abstract model of conceptual integrity that is inde- 
pendent of the technical approach that is chosen for 
data integration. This paper documents such a model. 
The goal is not to provide another method for maintain- 
ing conceptual integrity but to provide a logical model 
of conceptual integrity itself, capable of describing 
both correct and incorrect integrations resulting from 
whichever methodology is employed. 



2. Related Work 

Reference [3] contains a model that is similar in 
approach yet foundationally different from the one in 
this paper. Modal logic and logical properties are used 
to build a detailed model of identity and the ramifica- 
tions for correct and incorrect subsumption relation- 
ships are examined. Possible analogies to Ref. [3] are 
discussed in Sec. 7. 

An alternate view on the issues discussed in this 
paper can be found in work having to do with context 
logic, which traces back to McCarthy [4,5]. A concise 
discussion of the application of context logic to infor- 
mation integration is given in Ref. [6]; see also the 
"Integrating Databases" example in Ref. [7], multicon- 
text (MC) systems as described in Ref. [8], and the 
context-based schema analysis in Ref. [9]. The relation- 
ship between the view of this paper and the context 
logic view is explored in more detail in Sec. 8. 

Logic-based approaches to schema integration, e.g., 
Refs. [10] and [11], are constructive methods intended 
to maintain conceptual integrity; i.e., they assume that 
one works within the method when integrating schema- 
ta and that the intensional definitions constructed by the 
modeler are complete and logically sufficient. A loss of 
conceptual integrity within the model would be indicat- 
ed by the presence of a logical contradiction, which 
would render any subsequent logical inferences mean- 
ingless. Consequently these logic-based approaches to 
schema integration are not ideal for describing and ana- 
lyzing potentially imperfect integrations resulting from 
other methodologies. 

The views of class and abstraction in this paper par- 
tially reflect ideas appearing in Ref. [12]. 



3, Logical Notation 

Belief and time are critical to integration. Integration 
is performed at a point in time and from a point of view. 
Appropriately, this paper uses symbols from temporal 



modal logic as well as a "doxastic" modal (pertaining 
to belief). 

The following descriptions are quoted from Ref. [13]. 
□ It is necessary that . . . 
It is possible that . . . 
G It will always be the case that . . . 
F It will be the case that . . . 
H It has always been the case that . . . 
P It was the case that . . . 
Bx x believes that . . . 

a, v, and ~ are the conjunction, disjunction, and 
negation operators of classical logic. = represents logi- 
cal equivalence; i.e., the left hand side and the right 
hand side necessarily have the same truth values. 

Let p, q, and r represent arbitrary logical sentences. 
The modals □ and relate to each other as follows. 

up = ~§~p (1) 

0p=~n~/> (2) 

u~p = ~§p (3) 

O-^-np (4) 

The temporal modals have similar relations. 

Vp = ~G~p (5) 

Pp = ~H~p (6) 

A distinction is made between material implication, 
represented by the symbol id, and strict implication, 
represented by —>. 

Material implication is the truth-functional connec- 
tive of classical logic. 



pz>q = ~pvq 



(7) 



Strict implication expresses the stronger statement 
that the consequent necessarily follows from the 
antecedent (i.e., is logically entailed or true by defini- 
tion) [14]. 



p->q=n(p^q) 



(8) 



Strict implication must not be confused with relevant 
implication as used in relevance logics [15] or other- 
wise conflated with "relevance." Relevance is not 
required. It is acceptable (albeit unhelpful) that a tautol- 
ogy (a necessarily true sentence) is strictly implied by 
any sentence whatsoever. 
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°<7 -> (P -> ?) 



(9) 



The need to distinguish between material implication 
and strict implication arises here because of time. As 
new individuals are created, strict implications that are 
true remain true, but the truth values of material impli- 
cations can change. For example: over time, more peo- 
ple will be born; by definition, they will all be mortal 
(i.e., being a person strictly implies mortality); howev- 
er, the fact that all people live on Earth might not 
remain true (even though "being a person materially 
implies living on Earth" holds at the present time). If 
the universe of discourse were static, the distinction 
would be moot: if an implication happened to be true 
for the universe as it was, then it would suffice for all 
discussions about that universe. 

The reader is encouraged to consult Refs. [13], [14], 
and [16] regarding the spectrum of modal and temporal 
logics that are distinguished by the axioms accepted. 
Reference [16] identifies a series of systems that build 
on the following axioms (paraphrased): 
(SL = Sentential Logic) Every theorem of classic sen- 
tential logic is a theorem. 

(MP = Modus Ponens) Ifp is a theorem, and/? 3 q is a 
theorem, then q is a theorem. 

(Nee = "rule of necessitation") Ifp is a theorem, then 
up is a theorem. 
(0) <>p is defined as ~n~p. 

The above axioms are accepted here. System T is 
formed by adding the following two axioms, which are 
also accepted here. 



n(pz>q)z>(npz>nq) 


(10) 


up^p 


(11) 


It follows that 





P=>0p 



(12) 



In addition, the following axioms are accepted. 
(KB., since these are theorems, then by the rule of 
necessitation and the definition of strict implication, 
their counterparts using strict implication are also theo- 
rems.) 



Gpz> ¥p 
Hpz>?p 
pz>G?p 



(13) 
(14) 
(15) 



pziHFp 

up zd Gup a Hop 
G{p^q)^{Gp^Gq) 

H(p^(HpH^) 



(16) 
(17) 
(18) 
(19) 



4. Model 
4.1 Foundation 

A schema is a set of identified collections or group- 
ings. Those collections would be called classes in an 
object-oriented system, tables in a relational system, 
concepts in a knowledge-based system, etc. For read- 
ability, the word "class" will be used for a collection or 
grouping, and the word "individual" will be used for 
that which is grouped (instance, tuple, etc.). 

Let a, (3, y, and 8 range over classes, let a range over 
individuals, and let A range over properties. A Boolean 
model of properties is assumed. Aa is true if and only if 
individual a has the property A, Define A to be the 
negation of A. 



Aa = ~Aa 

Aav Aa 
~ (AqaAo) 



(20) 
(21) 

(22) 



Define N(a) as the set of properties that are neces- 
sary for membership in a. 



A e N(a) = a e a -> Aa 



(23) 



Define O(a) as the set of properties that arc possible 
for (consistent with) membership in a. 



A e O(a) = 0(a e a a Aa) 



(24) 



It is assumed that one will abstain from defining 
classes that are necessarily empty (also known as 
"incoherent" classes). 

These more intuitive theorems about N and O then 
follow: 



aeaA^e N(a) — » Aa 
^eN(a)4^N(a) 



(25) 
(26) 
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ae aAAa^> Ae 0(a) (27) 

^eO(a)^^N(a) (28) 

^£0(a)->!eN(a) (29) 

Ae N(a) -» Ae 0(a) a At O(a) (30) 

It is possible for both a property and its negation to 
appear in O, with neither appearing in N. 

As with □ and 0, it would suffice to have only N or 
only O, but having both allows for more intuitive for- 
mulations. 

Importantly it is assumed that membership in class- 
es is primitive. N contains properties that are necessary 
for membership in a class [as stated in Eq. (25), mem- 
bership in a class does strictly imply that the individual 
has the necessary properties], but they are not logically 
sufficient to determine the membership. Classes are not 
necessarily characterized by a set of properties: in gen- 
eral, A e N(a) z> Aa does not strictly imply a e a. 
Ideally the "intent" of a class would be reflected by the 
properties in N, but the extent (its membership) is what 
is assumed to be known, 

4.2 Subsumption 

N and O display a symmetry with respect to sub- 
sumption. 

(ae y^ae 8) aAe N(8) -» Ae N(y) (31) 

(aey^ae5)AieO(y)^ieO(5) (32) 

It follows from Eqs. (30) and (32) that a property that 
is necessary in a subclass must be consistent with the 
superclass: 



(ae y^ ae 8) aAe N(y) -» Ae 0(8) 



(33) 



It is not always obvious that defining a subclass has 
ramifications for the meaning of the superclass, but it is 
true nonetheless. If someone defines a subclass Six- 
Legged-Dog, and the subclass is not necessarily empty, 
it follows that having six legs is consistent with being a 
dog. This may greatly surprise the person who defined 
Dog originally but such is the kind of detail that one 
needs to know in order to perform a correct integration. 



4.3 Conceptual Integrity 

Let S and T represent different schemata (e.g., the 
data models implemented in two separate software sys- 
tems). S and T do not share individuals or classes; how- 
ever, they are discussed in terms of the same properties, 
all of which are within the same logical context. 

For a of S and p of T 7 , the simplest form of integra- 
tion is a partial "instance map" from members of a to 
members of p. Let M(a) for a e a represent the analog 
of a (its image under M, if such exists) in p. To main- 
tain conceptual integrity, the following condition must 
hold for all A: 



ae aAM(a)e pA^az) Ae 0(p) 



(34) 



To paraphrase: if an individual with a given property 
is mapped to an analog in p, then that property must be 
consistent with membership in p. It is not necessary 
that the analog possess that property if the negation of 
that property is also in 0(p); nor is it necessary that 
every individual in a have an analog. 



5. On Abstraction 

Data models as we know them are abstractions, and 
so are the mental models of the people who construct 
them. By definition, an abstraction of a thing or event 
is not identical to the thing or event itself and does not 
have all of its properties. Moreover, any documented 
model is at best an approximate expression of a mental 
model [17], and different data modelers think about dif- 
ferent properties even when they believe that they are 
modeling the same thing. These differences can lead to 
a wide variety of conflicts [9,18,19]. 

Every thing or event has an unbounded set of prop- 
erties. A data modeler tries to settle on a finite set of 
properties that suffices for a particular application. But 
when two applications are integrated, the properties 
that were captured in documented models may no 
longer suffice. 

Consider the acquisition by a leading manufacturer 
of 1 00 % recycled content corrugated boxes of a rela- 
tively obscure company that makes biodegradable bub- 
ble wrap. The box company has a technically superior 
customer database, but the bubble wrap company has 
some specialized applications integrated with its own 
database that would be expensive to change. So it is 
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decided to use the box company's database as the pri- 
mary one and just replicate the data in the other data- 
base for the sake of the specialized applications. This 
seems to work, and environmentally conscientious 
mail-order operations the world over rejoice that they 
can now obtain boxes and bubble wrap through the far- 
reaching distribution network of the former box compa- 
ny Then disaster strikes. The box company's best cus- 
tomer, the John Q. Fictional Company of Hanover, calls 
to complain that the bubble wrap they ordered never 
arrived. Investigation reveals that the order in question 
was shipped to the John Q. Fictional Company of 
Anchorage, a new customer who had simply ordered a 
small number of corrugated boxes. It turns out that one 
of the bubble wrap applications was written to key by 
company name, so it retrieved the wrong John Q. 
Fictional Company from the merged database and 
propagated the error. 

It is important to understand that the box company's 
database was not "wrong" to allow two companies to 
use the same name. Prior to the integration, it made no 
difference. The box company's applications did not rely 
on names being unique. Neither was it "wrong" for the 
bubble wrap application to key by company name. 
Prior to the integration, company names were unique 
within the bubble wrap customer base. The problem 
was created by the integration. 

Abstractions themselves have abstractions, and these 
are not immune to integration faults. For example, a 
common abstraction of time-of-day (itself an abstrac- 
tion) constrains seconds to range between and 59. An 
artifact that embodied this assumption might integrate 
successfully with many applications and operate for 
years without failure. However, cognizant data model- 
ers are aware that an extra second — a "leap 
second" — is occasionally inserted into the Coordinated 
Universal Time (UTC) time scale to keep it within 
±0.9 s of the Universal Time (UT1) astronomical time 
scale [20]. The time-of-day corresponding to the leap 
second is represented as 23:59:60. So if the artifact that 
constrains seconds to the range to 59 is integrated 
with any that propagate leap seconds, it might fail all of 
a sudden one New Year's. 

For any given abstraction, it is possible to construct 
an integration scenario in which a failure will occur 
because of some property that was not explicitly mod- 
eled. The need for the abstractions of S (e.g., customer 
name according to the box company) to take an explic- 
it stance with respect to properties that are relevant in T 
(e.g., uniquely identifying a customer) only arises 
when integration is attempted. Yet by virtue of numer- 
ous undocumented and/or un-thought-about implemen- 



tation details, any realizations of these abstractions in 
engineered artifacts such as software implicitly take 
stances with respect to all properties. Simplistically, 
one could say that when confronted with new proper- 
ties, either they work or they don't. 



6. Semantic Faults 

"Semantic fault" is an informal term that can now be 
understood formally to mean a violation of the condi- 
tion expressed in Eq. (34). 

This section demonstrates how the semantic fault 
stories of Sec. 5 can be formalized. However, it is not 
necessarily the case that all semantic faults would 
emerge in exactly the same way. 

Logical statements below describe the behaviors of 
the engineered artifacts as built unless preceded by the 
doxastic qualifier Bi (signifying a belief of the integra- 
tor, i). 

Consider the following: 



>*eO(P) 



(35) 



aea^Aa (36) 

Bi(aea^Aa)vBiG(aea^Aa)vBi(AeO($)) (37) 

The integrator builds a complete mapping from a to 



& 



ae a3M(a)e |3 



(38) 



and the integrated system functions normally. Now 
assume that at some future time, individual x will be 
born such that 



F(x e a a Ax) 



(39) 



Assuming that a e a z> M(a) e (3 remains true, con- 
ceptual integrity, Eq. (34), will then demand 



FG4eO(|3)) 



(40) 



which is not guaranteed. If the behavior of the engi- 
neered artifact is instead described by A e N((3), then 
G(A £ O(p)); the condition of Eq. (34) will be violated, 
and there will be a semantic fault. 

With the bindings shown in Table 1 , the above mod- 
els the examples described in Sec. 5. In the first exam- 
ple, individual x is the customer name "John Q. 
Fictional, Inc.;" conceptual integrity fails because that 
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Table 1. Bindings for Sec. 5 examples 



Boxes 

Customer name as projected from box company's database 
Customer name as projected from bubble wrap application 
Associated with exactly one customer 
Not associated with exactly one customer 



Time 

Time-of-day as delivered by leap seconds cognizant time service 

Time-of-day as represented in some application 

Has seconds in range 0...59 

Does not have seconds in range 0...59 



name is associated with more than one customer, which 
is inconsistent with the customer name class as project- 
ed from the bubble wrap application. In the second 
example, individual x is the time-of-day value 
23:59:60; conceptual integrity fails because that time- 
of-day has seconds outside the range 0...59, which is 
inconsistent with the time-of-day class as represented 
in the failing application. 



7. Analogies to Ref. [3] 

Reference [3] defines essential, rigid, non-rigid, and 
anti-rigid as properties of properties. The definitions 
are made in terms of properties, individuals, and 
instances of properties (i.e., individuals that have that 
property). Classes as such are subsumed by properties 
that completely characterize them. 

• A property is essential to an individual if and only if 
it necessarily holds for that individual at every possi- 
ble time in every possible world. 

• A property is rigid if and only if, necessarily, it is 
essential to all of its instances. 

• A property is non-rigid if and only if it is not rigid. 

• A property is anti-rigid if and only if it is not essen- 
tial to any of its instances. 

The concern whether a property is essential to an 
individual is different from the concern whether a prop- 
erty is necessary for membership in a class. These two 
concerns may become inextricable when classes are 
defined intensionally (when the possession of a given 
set of properties strictly implies class membership), but 
they do not when class membership is primitive. This 
divergence makes it difficult to construct valid analo- 
gies between the content of this paper and that of Ref. 
[3], despite apparent similarities. 

Returning to the definitions of Sec. 4.1, one could 
draw limited, perhaps strained, analogies. Given a class 
a and a property A, one could say that A is rigid with- 
in a if A e N(a), non-rigid if A e O(a). But class-cen- 
tered analogs to essential and anti-rigid would require 
an intensional viewpoint. 



8. Relationship to Context Logic 

In works about context logic it is common to use the 
notation ist(c, p) to signify that proposition p is true in 
the context c [5]. That convention is adopted here. 

Context is broadly interpreted and can be used in lieu 
of many specialized modals. One can identify contexts 
corresponding to spans of time, a particular person's 
beliefs, etc. 

In the case of data integration, it is natural to identi- 
fy contexts corresponding to the schemata being inte- 
grated and then make assertions about what is true of 
various classes in those contexts. For example, if Q is 
the context of a leap seconds cognizant time service, C 2 
is the context of some application, and x is "the" time- 
of-day class, then one would write the following, or 
something equivalent: 

ist (C ls SecondsMayExceed59(x)) (41) 

ist (C 2 , ~ SecondsMayExceed59(x)) (42) 

"The" time-of-day class is an abstraction inherited 
from a common context, such as a global schema. Its 
specializations in contexts Q and C 2 disagree with 
respect to the predicate SecondsMayExceed59. If the 
reference to a common context is eliminated, then there 
is no basis for discussion of whether the classes in Q 
and C 2 are compatible. 

The model presented in this paper does not require 
that classes from a common context be made explicit. It 
does rely on the assumption that properties have equiv- 
alent meanings in the contexts of the systems being 
integrated. However, this is analogous to the assump- 
tion that predicates such as SecondsMayExceed59 have 
the same meaning in multiple contexts. 

Of course, there is nothing to prevent one from mak- 
ing logical statements about predicates in different con- 
texts; e.g., 

ist(C 2 , ValidTimestamp(a)^Seconds(a) <60) (43) 
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But the problem repeats itself. Unless Seconds has a 
common interpretation, nothing has been gained by 
contextualizing ValidTimestamp. 

Ultimately to make comparisons between two con- 
texts, it is necessary to have some common vocabulary 
with which to conduct the discussion. The problem can 
be moved around but cannot be eliminated. As always, 
"there is no silver bullet" [1], but a change in viewpoint 
can sometimes help. The goal is to move the problem to 
where it causes the least amount of damage. 
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10. References 



9. Conclusion 

A logical model of conceptual integrity in data inte- 
gration and a simple example application have been 
presented. Unlike constructive models that attempt to 
prevent semantic faults, this model allows both correct 
and incorrect integrations to be described. Imperfect 
legacy systems can therefore be modeled, allowing a 
more formal analysis of their flaws and the possible 
remedies. 

Future work to extend the model could focus on bet- 
ter treatment of several issues that were glossed over or 
minimized. 

• The important temporal dimension of conceptual 
integrity could be explored in more detail and mod- 
eled more precisely 

• The abstractions implicit in the act of integration 
(pieces of an implicit "integration schema") could be 
analyzed. A partial mapping from members of a to 
members of |3 suggests an abstraction from a and |3 
that describes that part of the population that is "inter- 
esting" for the integration. A variant of formal con- 
cept analysis [21] may be applicable, as may current- 
ly evolving work on describing relations between 
ontologies [22]. 

• "Fuzzy" properties (i.e., where Aa is neither entirely 
true nor false, or is not known with certainty to be 
true — the different interpretations have different ram- 
ifications) could be explored. Additional analysis is 
needed to determine whether they add value. An infi- 
nite set of Boolean properties may render fuzzy prop- 
erties redundant: if Aa is only "sort of true, then it 
may be possible to derive a narrower "sub-property" 
that is fully true and another one that is fully false. On 
the other hand, it would be ill-advised to accept 
philosophical vague properties [23,24], which defy 
objective evaluation. 
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